# Multimodal Q&A
Llava 1.5 7b Hf Q4 K M GGUF
This model is a GGUF format conversion of llava-hf/llava-1.5-7b-hf, supporting image-to-text generation tasks.
Image-to-Text English
L
Marwan02
30
1
VL Rethinker 7B Mlx 4bit
Apache-2.0
VL-Rethinker-7B 4-bit MLX Quantized Version is a quantized variant of the TIGER-Lab/VL-Rethinker-7B model, optimized for Apple devices and supporting visual question-answering tasks.
Text-to-Image English
V
TheCluster
14
0
VL Rethinker 7B Fp16
Apache-2.0
This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.
Text-to-Image
Transformers English

V
mlx-community
17
0
VL Rethinker 72B 8bit
Apache-2.0
This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting 8-bit quantization and suitable for visual question-answering tasks.
Text-to-Image
Transformers English

V
mlx-community
18
0
VL Rethinker 72B 4bit
Apache-2.0
VL-Rethinker-72B-4bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, and has been converted to MLX format for efficient operation on Apple devices.
Text-to-Image
Transformers English

V
mlx-community
26
0
Gemma 3 4b It Abliterated Q4 0 GGUF
This model is a GGUF format conversion of mlabonne/gemma-3-4b-it-abliterated, combined with the visual component of x-ray_alpha for a smoother multimodal experience.
Image-to-Text
G
BernTheCreator
160
1
Llavaction 7B
LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.
Video-to-Text
Transformers English

L
MLAdaptiveIntelligence
149
1
Videochat Flash Qwen2 5 7B InternVideo2 1B
Apache-2.0
A multimodal video-text model built upon InternVideo2-1B and Qwen2.5-7B, using only 16 tokens per frame and supporting input sequences of up to 10,000 frames.
Text-to-Video
Transformers English

V
OpenGVLab
193
4
Asagi 8B
Apache-2.0
Asagi-8B is a large-scale Japanese Vision-Language Model (VLM) trained on extensive Japanese datasets, integrating diverse data sources.
Image-to-Text
Transformers Japanese

A
MIL-UT
58
4
Erax VL 7B V2.0 Preview I1 GGUF
Apache-2.0
This is the result of applying weight/importance matrix quantization to the EraX-VL-7B-V2.0-Preview model, offering multiple quantization versions to suit different needs
Image-to-Text Supports Multiple Languages
E
mradermacher
246
1
Videochat Flash Qwen2 5 2B Res448
Apache-2.0
VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.
Video-to-Text
Transformers English

V
OpenGVLab
904
18
Erax VL 7B V2.0 Preview
Apache-2.0
EraX-VL-7B-V2.0-Preview is a powerful multimodal model designed for OCR and visual question answering, excelling in processing multiple languages including Vietnamese, with outstanding performance in recognizing medical forms, invoices, and other documents.
Image-to-Text
Transformers Supports Multiple Languages

E
erax-ai
476
22
Mmalaya2
Apache-2.0
A multimodal model fine-tuned based on InternVL-Chat-V1-5, excelling in MMBench benchmark tests
Image-to-Text
M
DataCanvas
26
2
Idefics2 8b Chatty
Apache-2.0
Idefics2 is an open multimodal model capable of accepting arbitrary sequences of images and text as input and generating text output. The model can answer questions about images, describe visual content, create stories based on multiple images, or function purely as a language model.
Image-to-Text
Transformers English

I
HuggingFaceM4
617
94
Heron Chat Git Ja Stablelm Base 7b V1
A vision-language model capable of conversing about input images, supporting Japanese interaction
Image-to-Text
Transformers Japanese

H
turing-motors
54
2
Chattruth 7B
ChatTruth-7B is a multilingual vision-language model optimized based on the Qwen-VL architecture, enhanced with large-resolution image processing capabilities and incorporating a restoration module to reduce computational overhead
Image-to-Text
Transformers Supports Multiple Languages

C
mingdali
73
13
Heron Chat Git Ja Stablelm Base 7b V0
Heron GIT Japanese StableLM Base 7B is a vision-language model capable of conversing about input images.
Image-to-Text
Transformers Japanese

H
turing-motors
57
1
Idefics 9b
Other
IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.
Image-to-Text
Transformers English

I
HuggingFaceM4
3,676
46
Donut Refexp Combined V1
A model for visual question answering tasks, focusing on understanding user interface reference expressions.
Text-to-Image
Transformers English

D
ivelin
503
4
Featured Recommended AI Models